Distribution map using survey data

An example of Starry ray in the North Sea, using INLA

Authors: Geert Aarts, Niels Hintzen, Harriet van Overzee, Jurgen Batsleer, Jan Jaap Poos, Ingrid Tulp

Most of the data we collect at sea is spatially and temporally correlated, such as the International Bottom Trawl Survey (IBTS) or Beam Trawl survey data (BTS) survey data. Often we want to infer temporal or spatial trends in these data.

The statistical package INLA has the advantage over other software in R that it can combine spatial, temporal, zero-inflated, random effects etc models all in one. It therefore is very powerful and may become your one-stop-to-go package.

This document shows some examples of analyzing survey data using INLA. The first model estimates the number of individuals per haul, taking account of haul location (assuming spatial autocorrelation) and year (as a ‘fixed effect’). Differences in counts as a result of the two surveys are accounted for by also having survey as a fixed effect. Subsequent models take lengths into account in determining the counts. In the final part, the counts are expressed per unit area, so population estimates can be made using a ‘swept area’ approach.
## Preparing the R environment

First, we need to set the path where the data is located and to load the relevant libraries. The inla package can be installed using install.packages(“INLA”, repos=“https://inla.r-inla-download.org/R/stable”). We need the inla package, but also a number of packages for spatial analyses (‘rgeos’ and ‘rgdal’) and for plotting the data and the results (‘lattice’, ‘latticeExtra’, ‘grid’,‘gridExtra’,‘mapdata’, and ‘maptools’).

path <- "~/WMR-INLA/data"
#path <- "d://WMR-INLA/data" 

library(INLA); library(fields); 
library(mgcv)
library(lattice); library(latticeExtra); library(grid); library(gridExtra);
library(rgdal); library(rgeos); 
library(mapdata); library(maptools)

The IBTS data

We will use the BTS and IBTS data set. In this case we downloaded the data for starry ray. These dataset contains the CPUE per length per haul can be dowloaded from datras.ices.dk.

First we read in the IBTS dataset. From these data, the hauls in the North Sea are selected by keeping only those that are in (roundfish)areas 1-7.

IBTS <- read.csv(file.path(path,"CPUE per length per haul_2017-07-12 11_18_36.csv"), stringsAsFactors = F)
IBTS <- IBTS[IBTS$Area <= 7,]

The lengths are stored in mm and the CPUE is stored as number per hour. “Zero hauls” are included (hauls where no individuals are counted)

IBTS$NoPerHaul <- round(IBTS$CPUE_number_per_hour*(IBTS$HaulDur/60))

IBTS <- within(IBTS, haulCode <-  paste0(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong))

To get swept area information for the IBTS we need to read in the HH records of the IBTS. These HH records are available in the IBTS exchange files. The swept areas can be calculated from the travelled distance and the door spread or wing spread. Missing values for thease variables are coded as -9. In the r code those are made into NAs. We make a haul identifier for the hauls like we did for the count data.

IBTSHH <- read.csv(file.path(path,"Exchange Data_2017-07-14 09_26_25.csv"), stringsAsFactors = F)

IBTSHH <- within(IBTSHH, haulCode <-  paste0(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong))

IBTSHH[IBTSHH$DoorSpread == -9,]$DoorSpread <- NA
IBTSHH[IBTSHH$Distance == -9,]$Distance <- NA
IBTSHH[IBTSHH$WingSpread == -9,]$WingSpread <- NA

Next, we plot a histogram of haul duration, a scatterplot of distance versus haul duration, and a scatterplot of doorspread versus wingspread to get a look at the data. Clearly, haul duration is stored in minutes, with most of the hauls being either 30 minutes or 1 hour. Hauls longer than 80 minute are removed because these contain a large proportion of outliers with respect to e.g. distance. This step removes 36 from the 27703 hauls.

The wing spread and doorspread are stored in meters, with wing spreads being in the range 7, 50 meters and doorspread being in the range 1, 179 meters. There is one clear outlier

par(mfrow=c(1,3))
hist(IBTSHH$HaulDur,120, xlim=c(0,120), main= "", xlab="Haul duration (minutes)")
abline(v=80, lty=2)
IBTSHH <- IBTSHH[IBTSHH$HaulDur <= 80,]

plot(IBTSHH$HaulDur,IBTSHH$Distance, pch=20, xlab="Haul duration (minutes)", ylab= "Distance (m)")

plot(IBTSHH$WingSpread,IBTSHH$DoorSpread, pch=20, xlab="Wingspread (m)", ylab= "Doorspread (m)")
abline(a=0,b=1, lty=2)

A surface is calculated by multiplying wing spread by distance. Both are in metres. The resulting surface is thus in m2. This is converted to km2 by dividing by 1e+06.

IBTSHH$surface <- (IBTSHH$Distance * IBTSHH$WingSpread)/1e+06

Next, only relevant variables are selected.

IBTSHH <- IBTSHH[names(IBTSHH) %in% c("haulCode", "surface", "HaulDur","Doorspread","Distance","WingSpread")]

Ploting the surface against haul duration and using a simple linear model (without intercept) we hope to find a relationship.

plot(x=IBTSHH$HaulDur,y=IBTSHH$surface, pch=20, main= "IBTS", xlab= "Haul duration ( minutes)", ylab="Surface (km2)", xlim=c(0,80))
linmod <- lm(surface~ -1 + HaulDur, data=IBTSHH)
summary(linmod)
## 
## Call:
## lm(formula = surface ~ -1 + HaulDur, data = IBTSHH)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.041650 -0.006689 -0.000587  0.005138  0.116422 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## HaulDur 2.293e-03  4.171e-06   549.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01043 on 6494 degrees of freedom
##   (21172 observations deleted due to missingness)
## Multiple R-squared:  0.979,  Adjusted R-squared:  0.979 
## F-statistic: 3.022e+05 on 1 and 6494 DF,  p-value: < 2.2e-16
abline(linmod, lty=2)

Indeed, there is a good relationship, where each minute of haul duration adds coef(linmod) km2 haul surface for the IBTS. This relationship is used to calculate the surface of hauls where only haul duration is known.

IBTSHH[is.na(IBTSHH$surface),]$surface <- as.numeric(predict(linmod,newdata=IBTSHH))[is.na(IBTSHH$surface)]

Merge the haul information to the full cpue set, after removing haulduration from the haul data (because we have it in the CPUE set already)

IBTSHH$HaulDur <- NULL
IBTS <- merge(IBTS, IBTSHH,by= "haulCode", all.x=T, all.y=F)

After this merge, there are nrow(IBTS[is.na(IBTS$surface),]) observation in the IBTS data set that do not have a surface estimate, because their haul duration was larger than the treshold used in our calculations.

The BTS data

We have a similar but slightly different data set for BTS. Here, the “zero hauls” have lengthclass NA and the number per haul is NA. These observations are then set to zero, so that we have the same structure as in the IBTS data set.

BTS  <- read.csv(file.path(path,"CPUE per length per Hour and Swept Area_2017-07-12 11_29_36.csv"), stringsAsFactors = F)

BTS[is.na(BTS$LngtClass),]$NoPerHaul <- 0
BTS[is.na(BTS$LngtClass),]$LngtClass <- 0

Which vessels and years are present in BTS set? Note that there are some years in the BTS ISIS where the zero hauls are not added because no specimens were caught in that year (e.g. 2010). That should be fixed at ICES. Alternatively we could download the data with a more abundant species and make our own zero hauls.

table(BTS$Ship, BTS$Year)
##       
##        1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
##   ISI    64   81   91  119  108  107  101   93   91   88    0    0   96    0   70   88   86   82   99   77    0    0
##   SOL     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   68   60    0    0    0    0    0
##   SOL2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   58   45    0  272  238
##   TRI2    0    0    0    0    0    0    0    0    0  283  257  409  545  632  517  668  522  514  533  418  406  391
##       
##        2009 2010 2011 2012 2013 2014 2015 2016
##   ISI    96    0    0   87   79   69   82   94
##   SOL     0    0    0    0    0    0    0    0
##   SOL2  229  208  188  296  258   45  194  227
##   TRI2  360  303  405  402  384  356  280  292

The BTS swept area estimate is rounded, and thus not very useful. That’s why we will recalculate it: the distance and beam width are in m, so the surface is in m2, and divided by 1e+06 to get km2

# there are some negative distances. We make those  NA, and infer their surface later 
BTS[!is.na(BTS$Distance) & BTS$Distance < 0,]$Distance <- NA

#now calculate surface
BTS$surface <- (BTS$Distance * BTS$BeamWidth)/1e+06

# summary of the surface gives number of NAs, and whether there are negative values
summary(BTS$surface)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0044  0.0296  0.0312  0.0347  0.0336  0.0858     569
summary(BTS$HaulDur)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   30.00   30.00   34.39   30.00   60.00

We use the same procedure of inferring haul surface from haul duration for the missing surfaces of the BTS as we did for IBTS.

plot(x=BTS$HaulDur,y=BTS$surface, pch=20, main= "BTS", xlab="Haul duration (minutes)", ylab="Surface (km2)", xlim=c(0,80))
linmod <- lm(surface~ -1 + HaulDur, data=BTS)
summary(linmod)
## 
## Call:
## lm(formula = surface ~ -1 + HaulDur, data = BTS)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0201602 -0.0008962  0.0005198  0.0018958  0.0247837 
## 
## Coefficients:
##          Estimate Std. Error t value Pr(>|t|)    
## HaulDur 1.018e-03  9.201e-07    1106   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.003752 on 12741 degrees of freedom
##   (569 observations deleted due to missingness)
## Multiple R-squared:  0.9897, Adjusted R-squared:  0.9897 
## F-statistic: 1.223e+06 on 1 and 12741 DF,  p-value: < 2.2e-16
abline(linmod, lty=2)

BTS[is.na(BTS$surface),]$surface <- as.numeric(predict(linmod,newdata=BTS))[is.na(BTS$surface)]

Indeed, there is a good relationship, where each minute of haul duration adds coef(linmod) km2 haul surface for BTS. This relationship is used to calculate the surface of hauls where only haul duration is known.

Next, we make a unique haul code for the BTS, as we did for IBTS

#make haul code
BTS <- within(BTS, haulCode <- paste0(Year, Survey, Quarter, Ship, HaulNo, ShootLat, ShootLong))

The BTS and IBTS datasets need to be combined the two sets into one, using rbind() to combine them by rows. The datasets can only be combined if they have the same columns. The columns that are shared by the two data sets are found using intersect().

cols <- intersect(colnames(BTS), colnames(IBTS))
cpue <- rbind(BTS[,cols],IBTS[,cols])

#remove CPUE per hour that we do not need now that we have counts
cpue <- cpue[,!names(cpue)=="CPUE_number_per_hour"]

Now that the IBTS and BTS data sets are combined we want to make a set where we have counts for all hauls and all lenghts. This means first making an dentifier for each unique haul (based on the information we have for all the hauls). This identifier is used to make a “trawllist” where all the information for the hauls is stored together with its unique identifier.

Once the trawllist is make we use expand.grid() to make a combination of all hauls and lenght classes. This set is merged with our original data set.

trawllist <- cpue[!duplicated(cpue$haulCode),!names(cpue) %in% c("Species","AphiaID","NoPerHaul","Sex", "LngtClass")]

#expand grid 
hxl <- expand.grid(haulCode=unique(cpue$haulCode),LngtClass=unique(cpue$LngtClass), stringsAsFactors = F)
full_cpue <- merge(hxl,cpue[names(cpue) %in% c("haulCode", "LngtClass","NoPerHaul")], all.x=T)
rm(hxl)

After we merged all possible combinations with the data we now have NAs for those lengts and hauls where the catch is zero, and so we set those to zero. This data is subsequently merged to the trawllist so that we have all information together.

full_cpue[is.na(full_cpue$NoPerHaul),]$NoPerHaul <- 0
full_cpue  <- merge(full_cpue,trawllist, all.x=T)

The records that have lenghts equal to zero (that indicated zero hauls in our original set ) now need to be removed because we have all the information we need (these hauls now have zero catch for the full length range observed in the survey).

#now remove zero lengths
full_cpue <- full_cpue[full_cpue$LngtClass > 0,]

In addition, there are some observations that are highly unlikely: For instance there is a single observation of an individual of 100 cm (in 1977). This is highly suspicious because it is far larger than than any other observation, and likely due to species mis-identification. This can be seen in the histogram of length observations below.

par(mfrow=c(1,2))
len_cutoff <- 990
hist(full_cpue[full_cpue$Quarter==1 & full_cpue$NoPerHaul >0,]$LngtClass, breaks=100, xlab="Length (mm)", main="Observed lengths q1",xlim=c(0, max(full_cpue$LngtClass)), ylim=c(0,1500))
abline(v=len_cutoff, col="red")
grid()
lines(aggregate(NoPerHaul~LngtClass, data=full_cpue[full_cpue$Quarter==1 & full_cpue$NoPerHaul >0,], FUN="sum"))

hist(full_cpue[full_cpue$Quarter==3 & full_cpue$NoPerHaul >0,]$LngtClass, breaks=100, xlab="Length (mm)", main="Observed lengths q3",xlim=c(0, max(full_cpue$LngtClass)), ylim=c(0,1500))
abline(v=len_cutoff, col="red")
grid()
lines(aggregate(NoPerHaul~LngtClass, data=full_cpue[full_cpue$Quarter==3 & full_cpue$NoPerHaul >0,], FUN="sum"))

Ideally, we would go back to data and the people who collected the data to see if the correctness of the data can be confirmed. In this case there is no possibility to go back so far back in time. We remove these observation from the dat by selecting only observations with length < 990 mm. One could even consider removing the observations of individuals larger than 750 mm. Some of these were recorded in the Southern North Sea, far from where most of the catches were made. Those are likely species misidentification, e.g. with R. clavata. Note that when the length information is included in the analyses a further selection is made for the lengths.

#remove single unlikely large individual (of 1 m length) 
full_cpue <- full_cpue[full_cpue$LngtClass < len_cutoff,]
head(full_cpue)
##                   haulCode LngtClass NoPerHaul  Survey Quarter Ship Gear HaulNo Year HaulDur DayNight ShootLat
## 2 1966NS-IBTS1AND1054.55.5        10         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
## 3 1966NS-IBTS1AND1054.55.5        20         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
## 4 1966NS-IBTS1AND1054.55.5        30         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
## 5 1966NS-IBTS1AND1054.55.5        50         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
## 6 1966NS-IBTS1AND1054.55.5        60         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
## 7 1966NS-IBTS1AND1054.55.5        70         0 NS-IBTS       1  AND  H18     10 1966      60        D     54.5
##   ShootLong Depth Distance   surface
## 2       5.5    44       NA 0.1375566
## 3       5.5    44       NA 0.1375566
## 4       5.5    44       NA 0.1375566
## 5       5.5    44       NA 0.1375566
## 6       5.5    44       NA 0.1375566
## 7       5.5    44       NA 0.1375566

For our spatial correlation we will need an isomorphic coordinate system. Therefore we transform the latitudes and longitudes to UTM coordinates. these UTM coordinates are given in meters, so we divide them by 1000 to get kilometers.

UTM <- project(cbind(full_cpue$ShootLong, full_cpue$ShootLat), "+proj=utm +zone=31U ellps=WGS84") 

full_cpue$Xkm <- UTM[,1]/1000
full_cpue$Ykm <- UTM[,2]/1000

The INLA code does not like special characters in some of the variable names, like the hyphen in “NS-IBTS”. Therefore we rename the survey to NSIBTS.

full_cpue <- transform(full_cpue, Survey=gsub("-","",Survey))

Depth data

We try to also add depth to the model and we prepare for this now. We have depth for many but not all hauls in the full cpue set. The maximum fishing depth for IBTS standard stations in the North Sea is 200 m. and in Division IIIa 250 m. However, there are some hauls taken at depths deeper than 300 m. and we exclude those. The missing values are indicated by a value of -9, and these have to be transformed into NAs.

summary(full_cpue$Depth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9.00   36.00   56.00   66.92   90.00  415.00
full_cpue <- full_cpue[full_cpue$Depth < 300,]
full_cpue[full_cpue$Depth ==-9,]$Depth <- NA
summary(full_cpue$Depth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    6.00   37.00   58.00   68.22   91.00  295.00   35349

Given that the depth for a given location does not change much over the years, we use a generalized additive model to model a depth map, and to predict depths for those hauls where depth is missing. Using the depth in INLA requires that the depth is binned. We use 2 m depth bins.

The GAM model in the UTM coordinates. That will be helpfull when we do the INLA modelling and prediction: the grid we will be using there is in UTMs.

depthmodel <- gam(Depth ~ te(Xkm,Ykm,k=20), data=full_cpue[!duplicated(full_cpue$haulCode),])
summary(depthmodel)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Depth ~ te(Xkm, Ykm, k = 20)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.22366    0.04639    1471   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df    F p-value    
## te(Xkm,Ykm) 365.6  366.1 1962  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 380/400
## R-sq.(adj) =  0.964   Deviance explained = 96.4%
## GCV = 58.607  Scale est. = 57.807    n = 26858
gam.check(depthmodel)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 21 iterations.
## The RMS GCV score gradient at convergence was 0.0003061544 .
## The Hessian was not positive definite.
## Model rank =  380 / 400 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'     edf k-index p-value
## te(Xkm,Ykm) 399.000 365.605   0.581       0
plot(depthmodel)

full_cpue[is.na(full_cpue$Depth),]$Depth <-predict(depthmodel,newdata= full_cpue[is.na(full_cpue$Depth),])
full_cpue$Depth <- round(full_cpue$Depth/2,0)*2

The resulting depth map can be seen by plotting the fitted values of the depth model at each observations.

Xkms <- full_cpue[!duplicated(full_cpue$haulCode),]$Xkm
Ykms <- full_cpue[!duplicated(full_cpue$haulCode),]$Ykm
preds <- predict(depthmodel,newdata= data.frame(Xkm=Xkms,Ykm=Ykms ))

plot(Xkms,Ykms, pch=20,
col=rev(tim.colors(15, alpha = 1))[cut(preds,20)], 
xlab= "Easting (km)", ylab="Northing (km)", las=1)

Finally, we store the survey data (which still has a count per haul per length) and the GAM model for depth. The data will be used in the INLA examples. The depth model will be used to predict depths when we project the INLA results on a grid.

save(full_cpue, depthmodel,file=file.path(path,"survey_data.Rdata"))